Travel Package Purchase Prediction

Problem Statement

Objective

Data Dictionary

Tourism.csv - raw data that is used in this project.

  1. CustomerID: Unique customer ID
  2. ProdTaken: Whether the customer has purchased a package or not (0: No, 1: Yes)
  3. Age: Age of customer
  4. TypeofContact: How customer was contacted (Company Invited or Self Inquiry)
  5. CityTier: City tier depends on the development of a city, population, facilities, and living standards. The categories are ordered i.e. Tier 1 > Tier 2 > Tier 3
  6. Occupation: Occupation of customer
  7. Gender: Gender of customer
  8. NumberOfPersonVisiting: Total number of persons planning to take the trip with the customer
  9. PreferredPropertyStar: Preferred hotel property rating by customer
  10. MaritalStatus: Marital status of customer
  11. NumberOfTrips: Average number of trips in a year by customer
  12. Passport: The customer has a passport or not (0: No, 1: Yes)
  13. OwnCar: Whether the customers own a car or not (0: No, 1: Yes)
  14. NumberOfChildrenVisiting: Total number of children with age less than 5 planning to take the trip with the customer
  15. Designation: Designation of the customer in the current organization
  16. MonthlyIncome: Gross monthly income of the customer
  17. PitchSatisfactionScore: Sales pitch satisfaction score
  18. ProductPitched: Product pitched by the salesperson
  19. NumberOfFollowups: Total number of follow-ups has been done by the salesperson after the sales pitch
  20. DurationOfPitch: Duration of the pitch by a salesperson to the customer

Import the necessary packages

Load the dataset

Data Structure and Overview

Shape of the data

Check the types of data columns

Convert object to categorical type

Remove CustomerID

Check if there are missing values

Check if there are duplicates

Drop the duplicates

Check the summary stats for the dataset

Univariate Data Analysis

Observation on Age

Observation on CityTier

Observation on DurationOfPitch

Observation on NumberOfPersonVisiting

Observation on NumberOfFollowups

Observation on PreferredPropertyStar

Observation on NumberOfTrips

Observation on Passport

Observation on PitchSatisfactionScore

Observation on OwnCar

Observation on NumberOfChildrenVisiting

Observation on MonthlyIncome

Observations on non-numerical variables

Observation on ProdTaken

Observation on TypeofContact

Observation on Occupation

Observation on Gender

Observation on ProductPitched

Observation on MaritalStatus

Observation on Designation

Bivariate Data Analysis

ProdTaken vs Age

ProdTaken vs TypeofContact

ProdTaken vs CityTier

ProdTaken vs DurationOfPitch

ProdTaken vs Occupation

ProdTaken vs Gender

ProdTaken vs NumberOfPersonVisiting

ProdTaken vs NumberOfFollowups

ProdTaken vs ProductPitched

ProdTaken vs PreferredPropertyStar

ProdTaken vs MaritalStatus

ProdTaken vs NumberOfTrips

ProdTake vs Passport

ProdTaken vs PitchSatisfactionScore

ProdTaken vs OwnCar

ProdTaken vs NumberOfChildrenVisiting

ProdTaken vs Designation

ProdTaken vs MonthlyIncome

Summary of EDA

Data Description

Univariate Data Analysis

Bivariate Data Analysis

Data Pre-Processing

Missing Value Treatment

Outliers Treatment

Fix Gender Column

Summary of Data Pre-processing

Mising Value Treament

Outliers Treament

Fix Gender Column

Model Building

Model Evaluation Criterion

Model can make wrong predictions as:

  1. Predict customers that likely to purchase newly introduced travel package but they actually do not want to buy.
  2. Predict customers that do not want to purchase the travel package but they actually do.

Which case is more important?

Which metric to optimize?

Define dependent variable

Create dummy variables

Split data into training and testing set

Create functions to calculate different metrics and confusion matrix

Build model - Bagging

Model Building 1 - Bagging with default hyperparameters

Check Performance on Training

Check Performance on Testing

Model Building 2 - Random Forest with default hyperparameters

Check Performance on Training

Check Performance on Testing

Model Building 3 - Decision Tree with default hyperparameters

Check Performance on Training

Check Performance on Testing

Visualizing Decision Tree

Summary on Model Buildings with default hyperparameters

Bagging Classifier

Random Forest Classifier

Decision Tree Classifier

Model performance improvement - Tuning models

Tuning Bagging Classifier with GridSearch

Check Performance on Training

Check Performance on Testing

Tuning Bagging Classifier with weighted base estimator

Check Performance on Training

Check Performance on Testing

Tuning Random Forest classifier with GridSearch

Check Performance on Training

Check Performance on Testing

Tuning Random Forest classifier with class_weights

Check Performance on Training

Check Performance on Testing

Tuning Decision Tree Classifier - Pre-pruning with GridSearch

Check Performance on Training

Check Performance on Testing

Visualizing Tuned Decisision Tree with GridSearch

Tuning Decision Tree - Post-pruning with Cost Complexity Analysis

Recall vs alpha for training and testing sets

Check Performance on Training

Check Performance on Testing

Visualizing Decision Tree Post-pruning Cost Complexity

Summary on Model performance improvements

Bagging Classifier

Tuning Bagging Classifier with GridSearch

Tuning Bagging Classifier with weighted Decision Tree base estimator

Random Forest classifier

Tuning Random Forest classifier with GridSearch

Tuning Random Forest classifier with class_weights

Decision Tree Classifier

Tuning Decision Tree Classifier - Pre-pruning with GridSearch

Tuning Decision Tree - Post-pruning with Cost Complexity Analysis

---- Choose the best model ----

Boosting Models with default hyperparameters

Boosting model 1 - AdaBoost Classifier

Check Performance on Training

Check Performance on Testing

Boosting model 2 - Gradient Boosting Classifier

Check Performance on Training

Check Performance on Testing

Boosting model 3 - XGBoost Classifier

Check Performance on Training

Check Performance on Testing

Boosting model 4 - Stacking model with default hyperparameters

Check Performance on Training

Check Performance on Testing

Summary on Boosting Model Buildings with default hyperparameters

AdaBoost Classifier

Gradient Boosting Classifier

XGBoost Classifier

Stacking Classifier

Model performance improvement - Tuning Boosting models

Tuning Model 1 - AdaBoost Classifier with GridSearch hyperparameters

Check Performance on Training

Check Performance on Testing

Check feature importance

Tuning Model 2 - Gradient Boosting Classifier with GridSearch hyperparameters

Check Performance on Training

Check Performance on Testing

Check feature importance

Tuning Model 3 - XGBoost Classifier with GridSearch hyperparameters

Check Performance on Training

Check Performance on Testing

Check feature importance

Tuning Model 4 - Stacking Model with tuned models

Check Performance on Training

Check Performance on Testing

Summary on Boosting Model Buildings with hyperparameters tuning

AdaBoost Tuned with GridSearch

Gradient Boosting Tuned with GridSearch

XGBoost Tuned with GridSearch

Stacking Model Tuned with GridSearch

Comparing all models

Training models comparison

Testing models comparison

Conclusion

Business Recommendations